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Abstract 

This  paper  gives  an  overview  of  the  implementation  of  Nesl,  a  portable  nested  data-parallel  language. 
This  language  and  its  implementation  are  the  first  to  fully  support  nested  data  structures  as  well  as  nested 
data-parallel  function  calls.  These  features  allow  the  concise  description  of  parallel  algorithms  on  irregular 
data  structures,  such  as  sparse  matrices  and  grr^hs.  In  addition,  they  maintain  the  advantages  of  data-parailel 
languages:  a  simple  ptogranuning  model  and  portability.  The  current  NESL  implementation  is  based  on  an 
intermediate  language  called  VCODE  and  a  library  of  vector  routines  called  CVL.  It  runs  on  the  Connection 
Machines  CM-2  and  CM-S,  the  Cray  C90,  and  serial  workstations.  We  compare  initial  benchmark  results 
of  Nesl  with  those  of  machine-specific  code  on  these  machines  for  three  algorithms:  least-squares  line¬ 
fitting,  median  finding,  and  a  sparse-matrix  vector  product  These  results  show  that  Nesl’s  performance 
is  competitive  with  that  of  machine-specific  code  for  regular  dense  data,  and  is  often  superior  for  irregular 
data. 
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1  Introduction 


The  high  cost  of  lewritiog  parallel  code  has  resulted  in  significant  effort  directed  toward  developing  iiigh- 
level  languages  that  are  efficiently  portable  among  parallel  and  vector  supercomputers.  A  common  approach 
has  been  to  add  data-parallel  operations  to  existing  languages,  as  exemplified  by  the  High  Performance 
Rxtran  (HPF)  effort  [33J  and  various  extensions  to  C  (such  as  C*  [49, 47],  UC  [5],  and  C**  [38]).  Such 
data-parallel  extensions  offer  fine-grained  parallelism  and  a  simple  programming  model,  while  permitting 
efficient  implementation  pn  SIMD,  MIMD,  and  vector  machines.  On  the  other  hand,  it  is  generally  agreed 
that  although  these  language  extensions  are  well-suited  for  computations  on  dense  matrices  or  regular 
meshes,  they  ate  not  as  well-suited  for  algorithms  that  operate  on  irregular  structures,  such  as  unstructured 
sparse  matrices,  graphs,  or  trees  [29].  Languages  with  control-parallel  constructs  are  often  better  suited  for 
such  problems,  but  unfortunately  these  constructs  do  not  port  well  to  vector  machines,  SIMD  machines,  or 
MIMD  machines  with  vector  prcccssors. 

Nesteddata~parallellanguages[lQ]  combine  aspects  of  both  data-parallel  and  control-parallel  languages. 
Nested  data-parallel  languages  provide  hierarchical  data  structures  in  which  elements  of  an  aggregate  data 
structure  may  themselves  be  aggregates,  and  support  the  parallel  application  of  parallel  functions  to  multiple 
sets  of  data.  For  example,  a  sparse  array  can  be  repr^ented  as  a  sequence  of  rows,  each  of  which  is  a 
subsequence  containing  the  nonzero  elements  in  that  row  (each  subsequence  may  be  a  different  length). 
A  parallel  function  that  sums  the  elements  of  a  sequence  can  be  applied  in  parallel  to  sum  each  row  of 
this  sparse  matrix.  Because  the  calls  are  to  the  same  parallel  function,  a  technique  called  flattening  nested 
parallelism  [17]  allows  a  compiler  to  convert  them  into  a  form  that  runs  efficiently  on  vector  and  SIMD 
machines.  Nested  data-parallel  languages  therefore,  in  theory,  maintain  the  advantages  of  data-parallel 
languages  (fine-grained  parallelism,  a  simple  programming  model,  and  portability)  while  being  well-suited 
fix’  describing  algorithms  on  irregular  data  structures.  Their  efficient  implementation,  however,  has  not 
previously  been  demonstrated. 

As  part  of  the  Carnegie  Mellon  SCAL  project,  we  have  completed  a  first  implementation  of  a  nested 
data-parallel  language  called  Nesl.  This  implementation  is  based  on  an  intermediate  language  called  VcoDE 
and  a  library  of  vector  routines  called  CVL.  The  implementation  runs  on  the  Connection  Machines  CM-2 
and  CM-5,  the  Cray  C90,  and  serial  workstations.  (We  are  currently  working  on  a  version  for  workstation 
clusters.)  In  this  paper  we  describe  the  language  and  its  implementation,  provide  benchmark  numbers, 
and  analyze  the  benchmark  results.  These  results  demonstrate  that  it  is  possible  to  get  both  efficiency  and 
portability  on  a  variety  of  parallel  machines  with  a  nested  data-parallel  language. 

The  three  benchmarks  described  in  this  p^er  are  a  least-squares  line-fitting  algorithm,  a  median-finding 
algorithm,  and  a  sparse-matrix  vector  product.  Figure  1  summarizes  the  benchmark  timings.  For  each 
machine  we  give  direct  comparisons  to  well-written  native  code  compiled  with  full  optimization.  All  the 
Nesl  benchmark  times  given  in  this  paper  use  the  interpreted  version  of  our  intermediate  language  (as 
discussed  in  Section  5,  a  compiled  version  is  likely  to  be  significantly  faster).  The  line-fitting  benchmark 
measures  the  interpretive  overhead  in  our  implementation:  it  contains  no  nested  parallelism  and  therefore  the 
vectorizing  Fortran  77  and  CM  Fortran  compilers  generate  near-optimal  code.  The  median-finding  results 
show  the  benefit  of  NESL’s  dynamic  memory  allocation  and  dynamic  load  balancing  on  the  Connection 
Machines.  Finally,  the  sparse-matrix  benchmark  demonstrates  the  efficiency  of  NESL’s  nested  parallel 
functions  on  the  Cray  C90. 

The  paper  is  organized  as  follows:  Section  2  describes  Nesl  and  illustrates  how  nested  parallelism  can 
be  applied  to  some  simple  algorithms  on  sparse  matrices.  (A  description  of  how  NESL  can  be  used  for 
a  wide  variety  of  algorithms  is  given  elsewhere  [16].)  Section  3  outlines  the  components  of  the  current 
NESL  implementation.  Section  4  describes  our  benchmarks  and  Section  S  discusses  the  running  times  of  the 
bendunarks. 


2  Nesl  and  Nested  Parallelism 

Nesl  is  a  high'level  language  designed  to  allow  simple  and  concise  descriptions  of  nested  data-paraJlel 
programs  [11}.  It  is  strongly  typed  and  applicative  (free  of  side-effects).  Sequences  are  the  primitive 
data  type  and  the  language  provides  a  large  set  of  built-in  sequence  op«^ons  having  efficient  parallel 
implementations.  Nested  parallelism  is  achieved  through  the  ability  to  apply  functions  in  parallel  to  each 
element  of  a  sequence.  Nesl’s  iq)pIy-to-each  form  is  specified  using  a  set-like  notation  similar  to  set-formers 
in  SETL  [52].  For  example,  the  NESL  expression 

(negateta):  a  in  [3, -4, -9, 5]  I  a  <  4} 

is  read  as  “in  parallel,  for  each  a  in  the  sequence  [3 ,  -4 ,  -9 ,  5  ]  such  that  a  is  less  than  4,  negate 
a”.  Tl^  expression  returns  [  -  3 ,  4 ,  9  ] .  Parallelism  is  available  both  in  the  evaluation  of  the  expression 
to  the  left  of  the  colon  ( ; )  and  in  the  subselection  to  the  right  of  the  pipe  ( I ).  This  parallel  subselection 
can  be  implemented  with  packing  techniques  [6].  Nesl  also  supplies  a  set  of  parallel  functions  that  operate 
on  sequences  as  a  whole.  Figure  2  lists  several  of  these  functions;  a  full  list  can  be  found  in  the  Nesl 
manual  [1  i].  These  are  similar  to  operations  found  in  other  data-parailel  languages. 

Nesl  supports  nested  parallelism  by  allowing  sequences  as  elements  of  a  sequence  and  by  permitting 
the  parallel  sequence  functions  to  be  used  in  the  apply-to-each  construcL  For  example,  a  sparse  matrix  can 
be  rq)resented  as  a  sequence  of  rows,  each  of  which  is  a  sequence  of  ( co  lumn-nuinber ,  va  lue )  pairs. 
The  matrix 

2.0  -1.0 

m=  -1.0  2.0  -1.0 

-1.0  2.0 

is  rq)resented  with  this  technique  as 

m  =  [[(0,  2.0),  (1,  -1.0)], 

[(0,  -1.0),  (1,  2.0),  (2,  -1.0)3, 

[(1.  -1.0),  (2,  2.0)]]. 


2 


Operation 

Description 

#a 

Length  of  sequence  a. 

a[i} 

i*  element  of  sequence  a. 

sum(a) 

Sum  of  sequence  a. 

plus  .scan  (a) 

Parallel  prefix  with  addition. 

permute (a, i) 

Permute  elements  of  sequence  a  to  positions  given  in  sequence  i. 

a  ->  i 

Get  values  from  sequence  a  based  on  indices  in  sequence  i. 

a  ++  b 

Append  sequences  a  and  h. 

Figure  2:  Seven  of  Nesl’s  sequence  functions. 

Sum  values  in  each  row: 

{suin({v  :  (i,v)  in  row});  row  in  m)  ; 

Delete  elements  less  than  eps: 

{{(i,v)  in  row  I  v  >=  eps}:  row  in  m} ; 

Appmid  a  column  j  of  all  I’s: 

{ [  ( j  / 1 . 0 )  ]  ++  row  :  row  in  m}  ; 

Permute  the  rows  to  new  positions  p: 
permute  (m,p)  ; 

Figure  3:  Some  operations  on  sparse  matrices.  Note  that  the  last  operation  pemnutes  whole  rows. 

This  rqiresentation  can  be  used  for  matrices  with  arbitrary  patterns  of  non-zero  elements.  Figure  3  shows 
examples  of  some  useful  operations  on  matrices.  In  these  operations  there  is  parallelism  both  within  each 
row  and  across  the  rows.  The  available  parallelism  is  therefore  proportional  to  the  total  number  of  non-zero 
elements,  rather  than  the  number  of  rows  (outer  parallelism)  or  average  row  size  (inner  parallelism).  Graphs 
can  be  represented  analogously,  using  subsequences  to  store  adjacency  lists. 

Nested  parallelism  is  also  very  useful  in  divide-and-conquer  algorithms.  As  an  example,  consider  a 
parallel  quicksort  algorithm  (Figure  4).  The  three  assignments  for  les,  eql  and  grt  select  the  elements 
less  than,  equal  to,  and  greater  than  the  pivot,  respectively.  The  expression 

{qsort(v);v  in  [les,  grt]} 

puts  the  sequences  les  and  grt  together  into  a  nested  sequence  and  applies  qsort  in  parallel  to  the 
two  elements  of  this  sequence.  The  result  is  a  sequence  with  two  sorted  subsequences.  The  concatenation 
function  ++  is  then  used  to  append  the  three  sequences.  In  this  algorithm,  there  is  parallelism  both  within 
the  selection  of  each  of  the  three  intermediate  sequences  and  in  the  nested  parallel  execution  of  the  recursive 
calls.  A  flat  data-parallel  language  would  not  permit  the  recursive  calls  to  be  made  in  parallel. 

We  decided  to  define  a  new  language  instead  of  adding  nested  parallel  constructs  to  an  existing  language 
for  two  main  reasons.  First,  we  wanted  a  small  core  language,  to  allow  us  to  guarantee  that  everything 
that  is  expressed  in  parallel  compiles  into  a  parallel  form.  Second,  we  wanted  a  side-effect-free  language 
because  implementing  (and  defining  consistent  semantics  for)  nested  data-paralielism  when  nested  function 
calls  can  interact  with  each  other  through  side-effect  is  very  difficult 

Although  we  feel  that  it  would  be  possible  to  add  nested  data-parallelism  to  an  imperative  language,  we 
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fvmction  qsortCs)  = 
if  (#s  <  2)  then  s 
else 

let  pivot  =  s [reuid(#s)  ]  ; 

les  =  {e  in  si  e  <  pivot}; 

eql  =  {e  in  si  e  =  pivot}; 

grt  =  {e  in  s I  e  >  pivot} ; 

result  =  {qsort(v):v  in  [les,  grt]} 
in  result [0]  ++  eql  ++  result [1]; 

Figure  4:  A  nested  data-parallet  quicksort  in  Nesl. 

doubt  that  nested  data-parallelism  could  be  added  to  C  or  Fortran  in  such  a  way  that  the  resulting  language 
would  be  both  clean  and  efficient  For  C  it  is  likely  that  one  would  have  to  limit  the  type  of  objects  permitted 
in  the  parallel  data  structure  in  order  to  produce  efficient  code.  In  particular,  allowing  arbitrary  pointers 
in  a  parallel  structure,  while  highly  desirable,  would  also  be  very  hard  to  implement  For  Fortran  77,  the 
static  memory  requirements  would  severely  limit  the  usefulness  of  a  nested  data-parallel  language.  The 
additional  data  management  features  in  Fortran  90  could  reduce  these  limitations,  but  the  many  features  of 
the  language  are  so  delicately  balanced  that  permitting  nested  structures  would  likely  topple  it 

3  System  Overview 

The  lull  implementation  of  Nesl  consists  of  a  Nesl  compiler,  an  intermediate  language  called  VcoDE  [  12], 
an  interpreter  for  Vcode,  and  a  portable  library  of  parallel  routines  called  CvL  [13].  We  also  have  an 
experimental  VcODE  compiler  for  shared-memory  MIMD  machines  [19,  20].  The  roles  of  the  different 
components  are  shown  in  Figure  5.  This  section  gives  an  overview  of  the  each  of  these  components. 

The  Nesl  execution  times  reported  in  this  paper  are  for  interpreted  Vcode.  Use  of  an  interpreted 
intermediate  language  allows  us  to  port  our  system  very  quickly  to  new  machines;  the  only  component  that 
needs  to  be  rewritten  is  a  library  of  vector  routines  called  by  the  interpreter.  There  is  an  efficiency  loss  from 
using  an  interpreter;  this  loss  depends  on  the  particular  machine  and  on  the  problem  size.  One  portion  of  the 
overhead  is  a  fixed  cost  per  executed  VcoDE  instruction  (a  single  VCODE  instruction  operates  on  an  entire 
sequence).  This  constant  overhead  is  amortized  over  the  per-element  cost  of  executing  a  sequence  operation, 
so  the  system  attains  higher  efficiency  if  longer  sequences  are  used.  Our  technique  for  compiling  nested 
parallelism  (flattening)  allows  us  to  achieve  the  efficiency  of  operating  on  single  long  sequences  instead  of 
several  shorter  sequences  in  the  nested  parallel  calls.  These  efficiency  tradeoffs  are  analyzed  quantitatively 
in  Section  5. 

3.1  Nesl  compiler 

The  Nesl  compiler  translates  Nesl  code  into  VcODE.  It  is  written  in  Cormnon  Lisp  [55]  and  is  invoked 
within  an  interactive  environment.  Figure  6  shows  the  phases  of  the  compiler.  This  section  concentrates 
on  the  third  phase,  flattening  nested  parallelism,  and  briefly  discusses  the  fourth  phase,  type  specialization. 
The  type  checker  in  the  second  phase  is  a  Milner  style  type  checker  [42]  with  type  inference,  and  the  other 
phases  use  standard  compiler  techniques. 

Our  tqiproach  to  implementing  nested  parallelism  uses  a  technique  called  flattening  nested  paral¬ 
lelism  [17,  10,  46].  This  process  converts  nested  sequences  into  sets  of  flat  vectors  and  translates  the 
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Figure  5:  The  parts  of  the  SCAL  project  and  how  they  fit  together.  CAL  stands  for  Cray  Assembly 
Language.  C-PARIS  is  the  C  interface  to  the  CM-2  Parallel  Instruction  Set,  and  CMMD  is  the  CM-5 
message-passing  library.  Cluster  Cvl  is  under  development. 


Figure  6:  The  main  tasks  of  the  nesl  compiler. 

nested  operations  into  segmented*  VcoDE  operations  on  the  flattened  representation.  The  flattening  of 
nested  sequences  is  achieved  by  converting  each  sequence  into  a  pair,  a  value  vector  and  a  set  of  segment 
descriptors.  For  example,  the  sequence  [2,  9,  1,  5,  6,  3,  8]  would  be  represented  by  the  pair 

segdes  =  [7] 

value  =  [2,  9,  1,  5,  6,  3,  8] 

and  the  nested  sequence  [  [2,  1],  [7,  0.  3],  [  4  ]]  would  be  represented  as 

segdes 1  =  [3] 

segdes2  =  [2,  3,  1] 

value  =  [2,  1,  7,  0,  3,  4] 

In  these  examples,  a  segdes  with  only  one  value  specifies  that  the  whole  vector  is  one  segment  (the  use 
of  this  seemingly  redundant  data  is  critical  when  implementing  nested  versions  of  user-defined  functions). 


'S«giiieniB  ate  a  technique  for  implementing  nested  parallelism  and  are  fully  described  by  Blelloch  [8, 9,  10]. 
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In  the  second  example,  segdesl  describes  the  segmentation  of  segdes2,  not  of  value.  Sequences  that 
are  nested  deeper  are  represented  by  additional  segment  descriptors.  Sequences  of  fixed-sized  structures 
(such  as  pairs)  are  represented  by' multiple  value  vectors  (one  for  each  slot  in  the  structure)  that  share  a 
common  segment  descriptor. 

Using  this  r^resentation,  nested  versions  of  NESL  operations  with  VCODE  counterparts  can  be  directly 
converted  into  the  corresponding  segmented  VcODE  instruction.  For  example,  VcODE  includes  a  segmented 
i-- reduce  operation,  which  if  passed  the  two  vectors  shown  above,  would  return  the  vector  [3 ,  10 , 
4  ] .  Nested  versions  of  user-defined  functions  are  implemented  by  using  program  transformations  called 
stepping-up  and  stepping-down  [17].  These  transformations  convert  all  the  substeps  of  a  nested  call  into 
segmented  operations  or  into  calls  to  other  functions  that  have  already  been  transformed.  With  these 
transformations,  when  a  user  function  is  used  in  an  apply-to-each  form,  the  data  for  each  subcall  is  in  a 
separate  segment,  and  computations  on  each  segment  can  proceed  independently. 

The  most  complex  part  of  flattening  nested  parallelism  is  transforming  conditional  statements.  There  are 
two  main  parts  of  this  transformation.  The  first  part  inserts  code  for  packing  out  all  the  data  in  subcalls  that 
do  not  take  a  branch  and  for  reinserting  this  data  when  the  branch  is  complete.  This  guarantees  that  work  is 
only  done  on  the  subcalls  that  take  the  branch,  and  is  similar  to  techniques  used  by  vectorizing  compilers  to 
vectorize  loops  with  conditionals  [64,  §3.6].  It  also  results  in  proper  load  balancing  on  parallel  machines. 
The  second  part  inserts  code  to  test  if  any  of  the  subcaiis  are  taking  the  branch  and  to  skip  the  branch  if  none 
are.  This  is  important  for  guaranteeing  termination.  The  communication  costs  involved  in  doing  the  packing 
and  unpacking  can  be  quite  high,  and  one  area  of  our  ongoing  research  is  to  see  how  the  communication  can 
be  reduced  by  only  packing  when  there  is  a  significant  load  imbalance  among  processors  [53]. 

We  now  briefly  describe  the  third  phase  of  the  NESL  compiler,  which  involves  specializing  polymorphic 
code.  The  standard  implementation  of  polymorphism  is  to  have  a  single  compiled  version  of  a  function  and 
achieve  polymorphism  by  use  of  pointers — the  function  only  manipulates  pointers  to  the  data  rather  than  the 
actual  data  and  therefore  doesn’t  care  about  the  type  or  size  of  the  data.  In  this  scheme  there  is  almost  no 
control  of  layout  of  the  data  since  such  control  would  require  knowledge  about  the  size  of  the  data,  which 
dq)ends  on  its  type.  Because  of  the  high  cost  of  communication  among  processors  on  parallel  computers, 
it  is  important  that  we  have  control  over  the  layout  of  data.  The  Nesl  compiler  therefore  specializes  all 
polymorphic  functions  to  specific  types  and  generates  code  for  each  type.  The  type  system  is  such  that  the 
compiler  can  determine  all  types  to  which  a  particular  function  is  going  to  be  applied  at  compile  time.  The 
compiler  must,  however,  have  access  to  the  whole  program. 

3,2  VCODE 

VCODE  [12, 14]  was  designed  as  a  testbed  for  a  systematic  study  of  compiler  and  implementation  issues  that 
arise  in  data-parallel  languages.  Accordingly,  its  design  concentrates  on  data  parallelism  to  the  exclusion 
of  other  issues  more  commonly  seen  in  language  designs,  such  as  data  structures  and  advanced  control 
constructs. 

The  aggregate  data  type  in  Vcode  is  the  homogeneous  segmented  vector  of  atomic  types.  There 
are  four  basic  data  types:  boolean  vectors,  integer  vectors,  floating-point  vectors,  and  segment  descriptors. 
Segment  descriptors  describe  how  vectors  are  partitioned  into  segments.  For  example,  the  segment  descriptor 
[2,  3 ,  1  ]  specifies  that  a  vector  of  length  6  should  be  considered  as  3  segments  of  lengths  2,  3  and  1 , 
respectively  (segments  are  contiguous  and  non-overlapping).  Each  instruction  then  operates  independently 
over  each  segment  of  the  vector.  For  example,  the  +_scan  instruction  performs  a  parallel  prefix  operation 
on  each  segmenL  starting  from  zero  on  each  segment  boundary.  The  segmented  versions  of  the  instructions 
are  critical  for  the  implementation  of  nested  parallelism  (see  Section  3.1).  The  notion  of  segments  also 
appears  in  some  of  the  library  routines  of  the  Connection  Machines  CM-2  [57]  and  CM-5  [60]  and  has  been 
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adc^Med  ia  the  mEFDC  and  SUFFIX  operations  of  High  Performance  Fortran  [33].  The  internal  VcoDE 
repwatentarion  of  the  segment  descriptor  is  machine-dependent:  our  serial  implementation  uses  a  sequence 
of  die  lengths  of  each  segment,  while  our  implementations  on  the  Cray  and  the  Connection  Machines  CM-2 
and  CM-5  also  use  flags  to  mark  boundaries  between  segments. 

VCODB  can  dynamically  create  and  destroy  arbitrarily  long  vectors,  as  this  is  necessary  for  most  data- 
parallel  languages.  Unlike  machine-oriented  languages,  Vcode  provides  no  notion  of  processors,  either 
{fliysical  or  virtual.  This  allows  the  interpreter  or  compilo^  to  choose  strategies  for  process  mapping  and 
memory  management  appropriate  for  the  target  machine. 

VC(H>B  is  based  on  a  stack  model  of  memory  and  resembles  Pcode  [44],  a  serial  intermediate  language 
designed  for  Pascal.  It  is  conveniently  defined  in  terms  of  an  abstract  machine,  the  vector  stack  machine. 
This  is  a  standard  stack  machine,  except  that  each  stack  location  contains  an  arbitrarily  long  vector  of  atomic 
values.  Each  vector  has  a  length  associated  with  it,  and  vectors  in  different  positions  on  the  stack  may  have 
diffemt  lengths.  Instructions  operate  on  entire  vectors  at  a  time.  For  example,  the  +  instruction  pops  the 
top  two  vectors  (of  equal  length  and  appropriate  type)  from  the  stack,  adds  corresponding  elements  together, 
and  returns  the  result  vector  to  the  top  of  the  stack. 

VcoDB  contains  a  small  set  of  instructions,  most  of  which  operate  on  vectors  of  atomic  values.  The 
vector  instructions  operate  on  a  fixed  number  of  vectors  from  the  top  of  the  stack  and  return  their  result 
to  the  top  of  the  stack.  The  vector  instructions  can  be  divided  into  the  following  subclasses:  elementwise 
instructions,  which  are  apply-to-each  extensions  of  arithmetic  and  logical  operations;  scan  and  reduction 
instructions,  which  provide  reduction  and  parallel  prefix  operations  on  vectors  for  a  fixed  set  of  associative 
operators;  permute  instructions,  which  rearrange  elements  of  a  vector  according  to  another  vector  of  indices, 
possibly  accompanied  by  masking  or  default  vectors;  segment  descriptor  manipulation  instructions,  which 
create  and  manipulate  segment  descriptors;  and  I/O  instructions,  which  read  and  write  vectors  and  segment 
descriptors.  The  permute  instructions  can  be  used  to  implement  vector-based  indexing  in  C*.  Fortran  90, 
and  APL,  for  the operation  in  CM-Lisp  [63],  and  for  the  move  operation  in  Paralation  Lisp  [SI].  The 
segmented  instructions  permit  the  implementation  of  the  nested  parallelism  allowed  in  CM-Lisp,  SETL, 
and  Paralation  Lisp.  The  rest  of  the  Vcode  instructions  are  for  flow  control,  stack  management  and  system 
interface.  Of  special  note  are  the  stack  management  functions  POP  and  COPY,  each  of  which  takes  two 
arguments.  These  specify  a  distance  up  the  stack  to  start  and  a  number  of  elements  on  which  to  operate. 

A  Vcode  program  is  a  set  of  Vcode  functions.  Vcode  functions  are  maps  from  stack  prefixes  to  stack 
prefixes;  the  stack  is  used  to  pass  parameters  and  return  results.  Control  flow  within  a  function  is  directed 
by  a  conditional  form  and  recursion.  The  conditional  form  has  the  restriction  that  both  branches  must  leave 
the  stack  in  the  same  state  with  respect  to  depth  and  data  types.  This  is  necessary  to  ensure  that  functions 
are  well-formed.  All  instructions  are  strict,  i.e.,  all  inputs  must  be  evaluated  before  an  instruction  can  be 
executed.  Synchronization  is  implicit  and  lock-step:  an  instruction  must  be  completed  before  the  next 
instruction  can  begin  execution.  Vcode  is  side-effect  free,  so  some  permutation  instructions  may  require 
the  destination  to  be  copied  before  any  data  movement  occurs. 

33  Vcode  interpreter 

The  two  main  requirements  for  the  VcoDE  interpreter  are  portability  and  efficient  management  of  vector 
memory.  The  interpreter  is  written  in  ANSI  C  to  ease  portability  and  contains  little  machine-dependent  code 
(most  of  which  stems  from  operating  system  differences).  The  typical  sequence  of  operations  performed  by 
the  interpreter  to  execute  a  vector  operation  is  as  follows:  a  table  lookup  is  performed  to  find  the  number 
of  stack  arguments  an  operation  uses  and  the  length  and  type  of  the  operation’s  result;  optionally,  a  check  is 
made  that  any  length  and  type  constraints  on  the  arguments  are  satisfied;  memory  for  the  result  is  allocated, 
as  explained  below;  the  associated  CVL  function  is  called  to  perform  the  vector  operation;  after  completion. 
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Figure  7:  Internal  structure  of  the  Vcode  interpreter.  The  dark  entries  in  the  vector  block  list  refer 
to  active  regions  of  vector  memory.  The  light  ones  correspond  to  free  regions.  Entries  to  free 
regions  of  similar  size  are  linked  together  in  the  same  free  list  stnjcture.  There  are  32  free  lists, 
each  one  referring  to  regions  at  most  twice  the  size  of  the  previous. 

the  arguments  are  popped  off  the  stack.  All  this  results  in  a  constant  time  interpreti  ve  overhead  per  executed 
Vcode  instruction. 

The  most  novel  aspect  of  the  interpreter  is  its  memory  management  Vcode  creates  and  destroys  vectors 
of  differing  and  dynamically  determined  sizes.  Most  memory  management  and  garbage  collection  literature 
(see  [3],  for  example)  assumes  large  numbers  of  small  objects  and  a  large  (virtual)  memory.  Vcode  programs 
do  not  behave  in  this  manner.  In  particular,  there  are  usually  few  large  data  structures  (vectors)  active  at  any 
one  time,  and  we  want  to  operate  on  the  largest  possible  problem  size  that  available  memory  can  support. 
The  supercomputers  on  which  VcoDE  runs  typically  provide  no  virtual  memory  facilities,  so  there  is  a  hard 
limit  on  the  amount  of  memory  a  program  may  use.  Therefore  we  must  be  very  thrifty  with  the  amount  of 
memory  used  by  a  VcoDE  program. 

Figure  7  shows  the  internal  data  structures  for  memory  management  Each  entry  in  the  vector  stack  is 
really  a  pointer  to  an  entry  in  a  vector  block  list.  VcODE  stack  modification  operations  are  implemented  via 
pointer  manipulations;  no  action  is  taken  on  vector  memory. 

Entries  in  the  vector  block  list  describe  how  vector  memory  is  currently  partitioned.  The  vector  block  list 
maintains  an  order-preserving  1-1  map  to  regions  of  vector  memory  (in  terms  of  Figure  7,  the  arrows  from 
the  list  to  vector  memory  never  cross).  Each  region  of  vector  memory  is  either  active  or  free,  as  indicated 
by  a  flag  in  the  corresponding  vector  block  entry.  Each  vector  block  entry  has  a  reference  count,  giving  the 
number  of  times  it  occurs  on  the  stack.  When  this  number  reaches  zero,  the  vector  stored  in  the  region  can 
no  longer  be  referenced.  The  interpreter  frees  the ;  ociated  region  of  memory,  merges  it  with  any  adjacent 
firee  regions  (destroying  the  vector  block  entry  of  me  merged  region)  and  puts  it  on  a  free-list.  There  are 
several  free-lists,  each  corresponding  to  a  different  range  of  region  sizes  (currently,  each  list  has  regions  a 
factor  of  two  larger  than  the  previous  list).  Use  of  multiple  free  lists  for  different  sized  regions  allows  us  to 
examine  fewer  regions  to  And  one  of  the  appropriate  size. 

To  allocate  space  for  a  new  vector  of  known  memory  size,  the  interpreter  tries  to  find  a  large  enough  free 
region  of  vector  memory  by  doing  a  first-fit  on  the  appropriate  free-list.  If  no  block  in  that  free-list  is  big 
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enough,  the  free-iists  of  larger  region  sizes  ate  checked  in  a  hrst-ht  manner,  and  if  none  is  found,  the  garbage 
compacting  routine  is  called.  This  routine  pushes  all  vectors  as  far  as  possible  to  the  front  of  vector  memory, 
creating  one  large  region  of  free  memory  at  the  end.  If  this  region  is  not  large  enough,  the  interpreter  signals 
an  out-of-memory  error.  When  a  large  enough  region  has  been  found,  the  interpreter  divides  it  into  tvy^o 
pieces:  one  for  the  vector  and  one  that  is  returned  to  the  free-list  appropriate  for  that  region’s  size. 

The  reference  counts  in  the  vector  block  list  also  enable  the  interpreter  to  perform  simple  copy  elimination 
optimizations  [28,  31].  To  enforce  the  applicative  semantics  of  VCODE  the  implementation  of  an  operation 
dutt  dianges  only  a  single  element  of  a  vector  must  copy  the  entire  vector  before  making  any  alteration.  If 
the  reference  count  indicates  that  this  is  the  last  reference  to  the  “old”  version  of  the  vector,  we  may  avoid 
the  copy  and  implement  the  operation  in  a  (more  (.fficient)  destructive  manner. 

3.4  VCX)DE  compiler 

Chatterjee’s  doctoral  dissertation  [19,  20]  discusses  the  design,  implementation,  and  evaluation  of  an  opti¬ 
mizing  VcoDE  compiler  for  shared-memory  MIMD  machines.  In  this  section,  we  give  a  sununary  of  these 
issues. 

There  is.  of  course,  a  trivial  implementation  of  VCODE  on  a  MIMD  machine:  each  VcoDE  instruction 
is  written  as  a  parallel  loop,  and  the  processors  synchronize  between  instructions.  This  is  precisely  how 
the  VcoDB  interpreter  works.  However,  such  an  implementation  of  data  parallelism  on  a  MIMD  machine 
has  several  performance  bottlenecks,  which  limit  both  the  asymptotic  performance  and  the  performance  for 
small  problem  sizes. 

•  The  parallelism  of  the  source  language  is  too  fine-grained  for  MIMD  macliines.  Since  loop  overiiead 
scales  with  problem  size,  it  limits  the  asymptotic  performance  of  the  parallel  program. 

•  Serial  overhead  related  to  load  balancing,  storage  management  and  loop  setup  is  independent  of  the 
problem  size,  and  therefore  influences  the  small  problem  size  performance  but  not  the  asymptotic 
performance. 

•  The  implicit  lock-step  synclironization  of  the  data-parallel  model  is  expensive  to  implement  on  MIMD 
machines,  whereas  it  comes  for  free  on  SIMD  machines.  Since  the  cost  of  barrier  synchronization  on 
a  MIMD  machine  depends  on  the  number  of  processors  but  is  independent  of  problem  size,  this  only 
affects  the  performance  at  small  problem  sizes. 

•  The  (potentially  large)  intermediate  results  generated  by  the  fine-grained  parallelism  cause  problems 
for  the  memory  organizations  typically  found  in  MIMD  machines.  In  particular,  locality  of  reference 
and  its  concomitant  benefits  are  compromised.  Loss  of  locality  increases  data  access  times:  it  scales 
with  the  problem  size,  and  limits  asymptotic  performance. 

The  VCODE  compiler  addresses  these  problems  through  a  set  of  program  optimization  techniques  that 
cluster  computation  into  larger  loop  bodies,  reduce  intermediate  storage,  and  relax  synchronization  con¬ 
straints.  These  techniques  do  not  require  knowledge  of  data  sizes  or  the  number  of  processors  at  compile 
time.  There  are  four  principal  optimization  techniques: 

•  Size  inference  symbolically  identifies  equivalence  classes  of  data  sizes  ar»d  loop  structures. 

•  Clustering  uses  the  information  produced  by  size  inference  and  uses  it  to  partition  the  computation 
graph  into  a  set  of  parallel  loops  that  can  be  scheduled. 

•  Access  inference  refines  the  loops  produced  by  clustering  based  on  conflicts  between  definition  and 
usage  access  patterns  of  data  vectors,  producing  loops  free  of  any  explicit  synchronization. 
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•  Storage  optimizations  tetaove  may  storage  or  reduce  the  storage  firom  a  full  array  to  a  smaller  buffer 
sttxing  only  a  few  elements. 

The  compiler  accepts  Vcodb  as  input  and  outputs  C  code  with  thread  extensions  targeted  at  shared- 
memory  multiprocessors.  The  native  C  compiler  is  dien  invoked  to  produce  the  final  machine  code.  This 
makes  the  compiler  portable  and  allows  the  use  of  the  best  compiler  technology  available  for  the  machine. 

3^  CVL 

To  enable  rapid  porting  of  VCODE  to  new  machines,  we  designed  CvL  (C  Vector  Library)  [13],  a  library  of 
low-levei  segmented  vector  routines  callable  from  C.  These  are  used  by  a  VcoDE  interpreter,  described  in 
Section  33. 

The  purpose  of  CVL  is  to  provide  a  portable  segmented  vector  abstraction  that  can  be  efficiently 
implemented  on  a  wide  range  of  machines.  This  library  can  then  be  used  by  interpreters  and  compilers  of 
more  complicated  languages.  Examples  of  the  vector  routines  supplied  by  the  library  include;  numerous 
elementwise  operations  (arithmetical  and  logical),  scan  and  reduction  operations,  and  a  variety  of  permutation 
operations.  All  primitives  are  defined  for  both  unsegmented  and  segmented  vectors,*  and  for  several  data 
types:  integers,  booleans,  and  double  precision  floating  point  numbers.  Although  CvL  implementations  are 
machine-specific,  the  interface  to  the  library  is  machine-independent  Thus  C  code  written  using  CvL  is 
portable  across  all  machines  with  a  C  compiler. 

The  vector  memory  model  supplied  by  CVL  was  designed  to  be  portable  to  both  shared  and  distributed 
memory  machines.  CVL  stores  vectors  in  a  region  of  memory  (vector  memory)  that  should  be  viewed  as 
sq)arate  from  normal  memory.  CvL  exports  only  a  handle  to  a  vector,  known  as  a  vec-p.  Non-library 
functions  may  not  do  anything  with  a  vec^)  except  store  it  and  pass  it  as  a  parameter  to  CVL  routines. 
Given  a  vec^j,  there  are  CVL  functions  that  return  the  amount  of  vector  memory  that  the  corresponding 
vector  takes.  CVL  itself  does  no  memory  management,  other  than  the  initial  allocation  of  the  totality  of  the 
vector  memory.  It  is  the  responsibility  of  the  code  using  the  library  to  manage  this  memory.  One  way  of 
doing  this  is  described  in  Section  3.3. 

The  key  to  efficient  implementation  of  nested  data-parallelism  is  the  efficient  implementation  of  opera¬ 
tions  on  segmented  vectors,  which  in  our  system  corresponds  to  efficient  code  for  the  various  CVL  functions. 
Unfortunately  current  supercomputer  compilers  cannot  adequately  compile  Fortran  or  C  implementations 
of  these  operations,  forcing  us  to  implement  these  operations  at  a  very  low-level  for  the  target  machines. 

33.1  Cray  CvL 

We  have  implemented  Cray  Cvl  for  a  single  processor  of  the  Cray  Y-MP  and  Y-MP  C90.  The  CVL  library 
consists  of  Cray  Assembly  Language  [24]  kernels  which  are  called  from  C. 

The  dioice  of  Cray  Assembly  Language  (CAL)  might  seem  unnecessarily  low-level,  especially  for 
implementing  a  vector  language  on  a  vector  machine.  However,  while  the  elementwise  operations  could  be 
implemented  efficiently  in  C  or  Fortran,  it  is  very  difficult  to  produce  efficient  code  for  the  scan  operations 
and  segmented  operations  from  C  or  Fortran.  There  are  several  reasons  why  hand-written  assembly  code 
outperforms  C  or  Fortran  code.  For  segmented  operations,  the  segmentation  of  a  vector  is  represented  using 
packed  booleans,  a  representation  that  is  not  supported  by  high-level  languages.  Storing  64  booleans  in  a 
single  word  reduc^  memory  requirements  and  makes  efficient  use  of  the  Cray’s  functional  units.  The  scan 
operations  also  need  to  use  a  non-standard  loop  structure  we  call  loop  raking  [  15]  in  order  to  be  vectorized. 
Finally,  in  order  to  make  the  best  use  of  hardware  chaining  and  avoid  register  reservation  effects  [48],  it  is 
necessary  to  unroll  the  loops.  The  scan  and  segmented  scan  algorithms  are  described  in  detail  elsewhere  [21]. 
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CAL  code  is  very  efficient,  but  programming  in  raw  CAL  is  tedious,  error-prone,  and  non-portable. 
In  (xder  to  program  at  a  higher-level,  we  implemented  the  Cvl  kernels  using  our  own  nciacro  assembler 
(written  in  Common  Lisp)  that  generates  CAL.  In  the  assembler,  we  developed  a  set  of  loop  macros  for 
inq>lementing  scans,  including  a  version  that  generates  unrolled  loops.  Part  of  the  motivation  for  developing 
die  macro  assembler  was  to  allow  us  to  port  the  code  easily  to  other  vector  machines  in  the  future. 

Although  the  current  version  of  Cray  Cvl  only  uses  one  processor,  we  have  implemented  prototypes  of 
multiprocessor  vector  operations  [15]  and  we  are  working  on  a  complete  multiprocessor  library  using  the 
Cray  miootasking  facilities  [25]. 

33 J  CM-2CVL 

The  implementadon  of  Cvl  for  the  CM-2  is  built  on  top  of  the  Paris  instruction  set  [57].  As  a  machine- 
^lecific  language,  Paris  has  several  nice  features  for  implementing  CvL;  a  clean  interface  to  C,  a  model  of  the 
CM-2  as  a  collection  of  virtual  processors,  and  direct  support  for  scan,  reduce,  and  other  vector  operations. 
It  also  has  three  significant  drawbacks,  however.  Fu^L  it  uses  the  older  “heldwise”  representation  of  data, 
which  involves  extra  transpose  operations  when  using  the  floating  point  chips.  The  newer  and  more  efficient 
“slicewise”  representadon  is  not  supported  by  Paris.  Second.  Paris  vectors  (the  basic  Paris  data  type)  are 
limited  to  64K  bits  in  size,  and  the  current  CVL  implementation  inherits  this  limitation.  Third,  assigning  the 
responsibility  of  memory  management  to  VCODE  means  that  CVL  vectors  are  really  pointers  into  a  single 
big  block  of  memory.  Since  Paris  communication  functions  cannot  operate  on  these  pointers  (implemented 
as  Paris  “aliases”  with  non-zero  offsets),  Cvl  arguments  must  be  copied  into  temporary  vectors  before  use. 

This  problem  and  its  inelegant  solution  illustrate  the  difficulty  of  finding  an  intermediate  language  model 
that  can  be  efficiently  implemented  across  all  supercomputer  architectures.  In  this  case  (CM-2  Paris),  it 
would  be  more  efficient  if  the  job  of  memory  allocation  were  moved  from  VCOOE  to  Cvl.  This  would  let 
the  Cvl  code  allocate  new  Paris  vectors  on  demand,  thereby  eliminating  the  overhead  of  double  copying. 
Alternatively,  CM-2  Cvl  could  be  rewritten  at  a  much  lower  level  in  PEAK  [50]  code,  which  would  bypass 
alt  of  the  Paris-based  limitations,  but  at  a  high  cost  in  terms  of  programmer  effort  and  maintainability. 

33  J  CM-SCvl 

For  the  CM-S  implementation  of  Cvl,  we  originally  thought  that  CM  Fortran  [59]  would  be  the  language  of 
choice.  This  has  the  attraction  of  being  available  on  the  CM-2  as  well,  and  of  generating  slicewise  PEAK 
code  for  that  machine.  Thus  an  implementation  of  CVL  written  in  CM  Fortran  for  the  CM-5  should  be  easily 
portable  back  to  the  CM-2  and  holds  the  promise  of  offering  improved  performance  over  the  existing  Paris- 
based  CM-2  CVL.  Furthermore,  CM  Fortran  code  can  use  the  CM-5  vector  units,  which  greatly  increase 
the  speed  of  simple  elementwise  operations.  However,  CM  Fortran  does  not  provide  low-level  access  to 
the  underlying  machine.  This  caused  several  major  problems.  There  was  no  way  to  force  the  compiler  to 
layout  Cvl  memory  efficiently  (since  the  compiler  could  not  know  ahead  of  time  hr  m  the  memory  would 
be  subdivided  by  VcODE).  Similarly,  there  was  no  way  to  guarantee  maximal  alignment  of  Cvl  vectors, 
which  the  VcODE  interpreter  relies  upon  to  be  able  to.  for  example,  place  a  vector  of  doubles  where  a  vector 
of  booleans  used  to  start. 

We  therefore  wrote  CM-5  Cvl  in  C,  using  the  CMMD  [60]  message-passing  library.  This  combination 
gives  fuli  control  over  memory  layout  and  alignment,  allowing  vectors  to  be  load-balanced  across  all 
processors.  The  Cvl  scan  and  reduction  functions  are  implemented  using  the  corresponding  CMMD 
primitives,  while  functions  involving  arbitrary  communication  (such  as  permute  and  pack)  use  the  CMAML 
Active  Message  functions  [62].  Our  current  implementation  is  an  initial  release  that  does  not  fully  reflect 
the  power  of  the  machine.  Future  improvements  are  expected  to  include: 
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•  Using  CMNA  [61]  to  access  the  network  interface  chips  directly.  With  specialized  receive  functions, 
this  will  allow  the  full  payload  of  each  packet  to  be  used,  rather  than  sacrificing  one  word  to  the  active 
message  system.  Direct  access  to  the  on-chip  packet  counters  will  also  reduce  the  time  taken  to  agree 
on  whether  all  packets  have  been  sent  and  received. 

•  Using  CDI^AC  [58]  to  exploit  the  vector  units.  This  will  greatly  increase  the  speed  of  elementwise 
operations. 

3,5.4  Other  Cvl  implementations 

Our  implementation  of  Cvl  on  workstations  uses  ANSI  C  and  can  therefore  be  ported  easily  to  nmst  serial 
machines.  Cluster  CVL  is  being  written  using  C  and  PVM  [30].  Faith  et  aL  [27]  at  the  University  of  North 
Carolina  have  ported  Cvl  to  the  Maspar  MP-2  using  the  MPL  programming  language. 

4  Benchmarks 

This  section  describes  three  benchmarks:  a  least-squares  line-fiL  a  generalized  median  find,  and  a  sparse- 
matrix  vector  product  The  particular  benchmarks  were  selected  for  their  diverse  computational  require¬ 
ments,  summarized  in  Figure  8.  They  are  each  simple  enough  that  the  reader  should  be  able  to  fully 
understand  what  the  algorithm  is  doing,  but  are  more  complex  than  bare  kernels  such  as  the  Livermore 
Loops  [41].  The  performance  of  these  benchmarks  demonstrates  many  of  the  advanuges  and  disadvantages 
of  our  system.  The  results  of  these  benchmarks  will  be  analyzed  in  Section  5. 


Communication 

Dynamic  Structures 

Nested  Parallelism 

Line  Fit 

Low 

No 

No 

Median 

Yes 

No 

Sparse  MxV 

No 

Yes 

Figure  8;  The  properties  of  the  benchmarks. 

Our  first  benchmark  is  aleast-squaresline-httingroutine  using  the  algorithm  described  in  Press  etal.  [4S, 
§14.2].  The  version  we  use  assumes  all  points  have  equal  measurement  errors.  This  is  a  straightforward 
algorithm  that  requires  very  little  communication  and  has  no  nested  parallelism.  Furthermore,  all  vectors  are 
of  known  size  at  function  invocation  and  can  be  allocated  stoically.  It  is  therefore  well-suited  for  languages 
such  as  Fortran  90.  We  use  this  benchmark  to  measure  the  overhead  incurred  by  our  interpreter-based 
implementation.  The  Nesl  code  for  the  benchmark  is  given  in  Figure  9. 

Our  second  benchmark  is  a  median-finding  algorithm.  To  implement  median-finding  we  use  a  more 
general  algorithm  that  finds  the  k***  smallest  element  in  a  set  The  algorithm  is  based  on  the  quickseiect 
method  [34].  This  method  is  similar  to  quicksort,  but  calls  itself  recursively  only  on  the  partition  containing 
the  result  (the  recursion  was  removed  in  the  Fortran  version).  This  algorithm  requires  dynamic  memory 
allocation  (also  removed  in  the  Fortran  version)  since  the  sizes  of  the  less-than-pivot  and  greater-than-pivot 
sets  are  data  dependent  In  order  to  obtain  proper  load  balancing,  the  data  must  be  redistributed  on  each 
iteration.  The  NESL  code  for  the  algorithm  is  shown  in  Figure  10.  This  algorithm  was  selected  to  demonstrate 
the  utility  and  efficiency  of  Nesl’s  dynamic  allocation. 

Our  third  benchmark  multiplies  a  sparse  matrix  by  a  dense  vector  and  demonstrates  the  power  and 
efficiency  of  nested  parallelism.  Sparse-matrix  vector  multiplication  is  an  important  supercomputer  kernel 
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fxuiction  linef  it  (x,  y)  = 
let 

n  =  float (#x); 

xa  suin(x)/n; 
ya  =  sum(y)/n; 

Stt  =  sum({(x  -  xa)“2:  x}); 

b  =  suin{{(x  -  xa)*y:  x;  y})/Stt; 

a  =  ya  -  xa*b; 

chi2  =  suin({  (y-a-b*x)  ^2 :  x;  y})  ; 
siga  =  sqrt((1.0/n  +  xa^2/Stt) *chi2/n) ; 
sigb  =  sqrt ( (1 . 0/Stt ) *chi2/n) 
in 

(a,  b,  siga,  sigb) ; 

Hgure  9:  Nesl  code  for  fitting  a  line  using  a  least-square  fit.  The  function  takes  sequences  of 
X  and  y  coordinates  and  returns  the  intercept  (a)  and  slope  (b)  and  their  respective  probable 
uncertainties  (siga  and  sigb). 

that  is  difficult  to  vectorize  and  parallelize  efficiently  because  of  its  irregular  data  structures  and  high 
communication  requirements.  While  there  are  many  algorithms  for  special  classes  of  sparse  matrices,  we 
are  interested  in  supporting  operations  for  arbitrary  sparse  matrices.  This  is  challenging  since  the  matrices 
used  in  a  number  of  different  scientific  and  engineering  disciplines  often  have  average  row  lengths  of  less 
than  10.  These  row  lengths  are  significantly  less  than  the  start-up  overiiead  for  vector  machines  (ni/2) 
and  are  far  too  small  to  divide  among  processors  in  an  attempt  to  parallelize  row  by  row.  On  the  other 
hand,  dividing  rows  among  processors  makes  load  balancing  difficult  since  each  row  can  have  a  different 
length  and  the  longest  rows  could  be  very  much  longer  than  the  shortest  Our  implementations  (in  NESL, 
C,  and  Fortran)  use  a  compressed  row  format  containing  the  number  of  non-zero  elements  in  each  row,  and 
the  values  of  each  non-zero  matrix  element  along  with  its  column  index  [26].  Figure  1 1  shows  the  Nesl 
implementation  and  Figure  12  shows  an  example  of  a  sparse-matrix  vector  product  using  this  format. 

5  Results 

Running  times  for  all  benchmarks  with  a  variety  of  data  sizes  are  given  in  Table  I.  Times  are  given  for 
both  interpreted  NESL  code  and  for  native  code.  For  native  code  we  used  Fortran  77  on  the  Cray  C90, 
CM  Fortran  [59]  on  the  Connection  Machines  CM-2  and  CM-5,  and  C  on  the  DEC  Alpha  workstation.  In 
all  cases  we  used  full  optimization.  The  native  code  we  used  is  given  in  Appendix  A.  Nesl  timings  are  for 
the  code  shown  in  Section  4,  run  using  the  VCODE  interpreter.  All  Alpha  benchmarks  were  run  on  a  DEC 
3000  AXP  Model  400  with  32  Mbytes  of  memory.  All  Cray  C90  benchmarks  were  run  on  one  processor 
of  a  C90/16  with  256  Mwords  of  memory.  Alt  Connection  Machine  CM-2  benchmarks  were  run  on  32K 
processors  of  a  CM-2,  with  32  Kbytes  of  memory  per  processor.  All  Connection  Machine  CM-5  benchmarks 
were  run  on  256  processors  of  a  CM-5,  with  32  Mbytes  of  memory  per  processor. 

The  CM-5  CM  Fortran  benchmarks  did  not  use  the  vector  units,  in  order  to  enable  a  better  comparison 
to  be  made  with  the  current  implementation  of  CVL  on  that  machine.  When  the  vector  units  are  used,  the 
CM  Fortran  line-fit  benchmark  runs  1-40  times  faster  (depending  on  problem  size),  the  CM  Fortran  median 
benchmark  runs  2-5  times  faster,  and  the  CM  Fortran  sparse-matrix  vector  product  benchmark  runs  1-1.5 

n 


function  select Jcth(s,  k)  = 
let  pivot  =  s[#s/2]; 

les  =  {e  in  s  1  e  <  pivot} 
in 

if  (k  <  #les)  then 
select -kth( les,  k) 
else 

let  grt  =  {e  in  s  I  e  >  pivot} 
in  if  (k  >=  #s  -  #grt)  then 

select-kth(grt,  k  -  {#s  -  #grt)) 
else  pivot; 

function  median(s)  =  select-kth(s,  #s/2); 


Figure  10:  Nesl  code  for  median  finding.  The  function  selectJcth  returns  the  kth  smallest 
element  of  s.  This  is  used  by  median  to  find  the  middle  element. 

function  MxV(Mval,  Midx,  Mien,  Vect)  = 
let  V  =  Vect  ->  Midx; 

p  =  {a  *  b:  a  in  Mval;  b  in  v} 
in 

{sum(row)  :  row  in  nest  (p.  Mien)}; 


Figure  11:  NESL  code  for  sparse-matrix  vector  product.  Mval  holds  the  matnx'vatues,  Midx  holds 
the  column  indices.  Mien  holds  the  length  of  each  row,  and  Vect  is  the  input  vector.  The  function 
nest  takes  the  flat  senuence  p  and  nests  it  using  the  lengths  in  Mien  (the  sum  of  the  values  in 
Mien  must  equal  the  length  of  p). 
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Figure  12:  An  example  of  sparse-matrix  vector  product. 
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Table  1:  Running  times  (in  seconds)  of  the  benchmarks  for  Nesl  and  native  code.  The  sparse* 
matrix  vector  product  uses  a  row  length  of  5  and  randomly-selected  column  indices.  CM-5  Nesl 
results  are  preliminary. 

times  faster.  It  is  expected  that  a  future  version  of  CM-5  CVL  will  exploit  the  vector  units  and  be  able  to 
achieve  similar  speedups. 

We  now  discuss  three  main  issues  exhibited  by  the  timings;  the  advantage  of  nested  parallelism  in  the 
implementation  of  the  sparse-matrix  vector  product,  the  overhead  incurred  by  our  interpreter,  and  the  need 
for  dynamic  load-balancing  in  the  median-finding  code  on  the  Connection  Machine  CM-2. 


Nested  Parallelism:  The  sparse-matrix  vector  product  benchmark  demonstrates  the  advantages  and  efh- 
ciency  of  nested  data  parallelism.  Figure  13  gives  running  times  on  the  Cray  for  a  variety  of  degrees  of 
sparsity.  For  very  sparse  matrices,  the  Nesl  version  outperforms  the  native  version  by  over  a  factor  of  ten. 
This  performance  gain  arises  because  the  compilation  of  nested  data  parallelism  described  in  Section  3.1 
generates  code  with  running  time  essentially  independent  of  the  size  of  the  sub-sequences.  The  nested 
code  achieves  full  efficiency  (vectorization  on  the  Cray  and  high  data-to-processor  ratio  on  the  CM-2  and 
CM-5)  by  executing  on  the  full  input  data.  The  result  is  consistently  high  performance  regardless  of  the 
sparsity  of  tlie  matrix.  Note,  however,  that  as  the  matrix  density  increases,  the  Cray  Fortran  performance 
improves.  Eventually,  Fortran  achieves  superior  performance  because  of  Nesl’s  extra  per-element  cost  of 
interpretation  relative  to  compilation. 


Interpretive  overhead:  The  main  source  of  inefficiency  in  our  system  is  the  interpretation  of  the  VCODE 
generated  by  the  Nesl  compiler.  The  cost  of  int^retation  can  be  analyzed  by  studying  the  line-fitting 
benchmark  since  this  benchmark  requires  very  little  communication  and  the  native-code  implementations 
compile  to  almost  perfect  code. 

There  are  two  main  sources  of  interpretive  overhead  in  our  system.  First,  there  is  the  cost  of  executing 
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Figure  13:  Running  times  of  the  sparse-matrix  vector  product  for  varying  levels  of  sparsity.  The 
number  of  nonzero  entries  in  each  sparse  matrix  is  fixed  at  10*^. 

the  interpreter  itself.  For  the  line-fittiag  benchmark,  this  is  constant,  independent  of  input  size  (since  the 
interpreter  executes  a  fixed  number  of  VcoDE  steps),  and  so  it  can  be  computed  by  examining  the  running 
times  for  small  input  Figure  14  shows  the  percentage  of  run  time  accounted  for  by  this  ov^head  for  varying 
input  sizes,  as  well  as  the  /t  1/2  value  at  which  the  implementations  attain  half  of  their  asymptotic  efficiency. 
As  the  figure  shows,  NESL  sometimes  requires  fairly  large  input  in  order  to  attain  close  to  its  peak  efficiency. 
This  overhead  is  not  a  problem  on  the  CM-2;  since  there  are  32K  processors,  the  loss  of  efficiency  when 
working  with  small  vectors  (n  <  32K)  overwhelms  the  interpretive  overhead. 

The  second  major  deficiency  of  an  interpreter-based  system  is  that  the  granularity  of  the  operations 
performed  by  the  library  is  too  fine.  Each  operation  on  a  collection  of  data  is  performed  by  a  distinct  call 
to  the  CVL  library.  In  a  compiled  system,  the  loops  performing  the  separate  parallel  operations  could  be 
fused  together:  This  optimization  would  result  in  much  better  memory  locality  (quantities  could  be  kept 
in  registers  and  reused,  instead  of  being  loaded  from  memory,  acted  on,  and  written  back)  and  would  also 
allow  chaining  on  the  Cray.  These  loop  fusion  operations  are  performed  by  the  VcoDE  compiler  [  19,  20]. 
With  the  interpreter  these  inefficiencies  adversely  affect  the  peak  performance  of  Nesl  programs,  and  their 
effects  can  be  seen  in  the  performance  of  line-fitting  for  large  data  sizes  (see  Table  1).  On  the  CM-2  there 
is  an  additional  important  source  of  inefficiency:  CM-2  CVL  is  built  on  top  of  the  Paris  instruction  set  [57], 
Although  working  with  Paris  has  many  advantages,  it  forces  use  of  the  older  “fieldwise”  representation  of 
data,  instead  of  the  more  efficient  “slicewise”  representation  generated  by  the  CM  Fortran  compiler. 

Dynamic  load  balancing:  We  now  consider  why  the  native  code  for  the  median  algorithm  does  poorly 
compared  to  the  NESL  code  on  the  CM-2.  The  median  algorithm  reduces  the  number  of  active  elements 
on  each  stqi.  In  our  CM  Fortran  implementation,  as  these  elements  get  packed  to  the  bottom  of  an  array, 
they  become  more  imbalanced  across  the  processors.  Although  it  is  possible  to  pack  the  elements  into 
a  smaller  array,  this  would  require  dynamically  allocating  a  new  vector  on  each  step,  which  is  awkward 
in  CM  Fortranln  Nesl,  vectors  are  dynamically  allocated  with  the  data  automatically  balanced  across  the 
processors.  The  Nesl  implementation  of  the  median  algorithm  only  requires  a  total  of  0(n)  work  because 
on  each  of  the  0(logn)  steps  the  amount  of  data  processed  is  cut  by  a  constant  factor.  Since  the  CM  Fortran 
implementation  requires  0(n)  work  on  each  step,  it  is  a  factor  of  (7(logn)  slower,  as  illustrated  in  Figure  15. 
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Figure  14:  Interpreter  overtiead  for  the  line-fitting  benchmark.  The  vertical  lines  indicate  the  points 
at  which  overhead  accounts  for  50%  of  the  running  time.  The  percentage  overhead  for  the  CM-2 
Nesl  implementation  is  comparable  to  that  for  the  CM  Fortran  implementation.  The  Cray  Fortran 
overhead  is  insignificant  for  the  data  sizes  in  the  graph  and  is  not  shown. 

6  Comparison  to  Other  Systems 

Numerous  flat  data-parallel  languages  have  been  proposed  for  portable  parallel  programming,  such  as 
C*  [49],  MPP-Pascal  [7],  *Lisp  [39],  UC  [5],  and  Fortran  90  [2].  Section  2  explained  some  of  the 
expiessibility  and  efficiency  limitations  imposed  by  flat  languages;  these  problems  are  also  discussed 
elsewhere  [10, 11. 29, 32, 54]. 

Twoexistingparallel  languages  permit  the  user  to  describe  nested  data-parallel  operations;  CM-Lisp  [63] 
and  Paralation  Lisp  [51].  However,  the  implementations  of  these  languages  only  exploit  the  bottom  level  of 
parallelism;  for  the  sparse-matrix  example,  this  results  in  a  parallel  sum  for  each  row  and  a  serial  loop  over 
the  rows.  Both  of  these  languages  are  data-parallel  extensions  to  Common  Lisp  [55].  The  large  number  of 
features  in  Common  Lisp,  and  the  difficulty  of  extending  their  semantics  to  parallel  execution,  preclude  the 
implementation  of  full  nested  data  parallelism.  This  is  strong  motivation  for  a  simple  core  language. 

The  parallel  languages  ID  [43],  SISAL  [40],  and  Crystal  [22],  although  not  explicitly  data-parallel,  do 
support  fine-grained  parallelism.  They  also  support  nested  data  structures,  although  there  has  been  little 
research  on  implementing  nested  parallelism  for  these  languages.  There  are  also  several  serial  languages 
that  supply  data-paralld  primitives  and  nested  structures;  these  include  SETL  [52],  APL2  [37],  J  [36], 
and  FP  [4].  Sipelsteln  and  Blelloch  [54]  discuss  these  languages  from  the  perspective  of  supporting  data 
parallelism. 

Another  ^}proach  to  architecture-independent  parallel  programming  is  based  on  control-parallel  lan¬ 
guages  that  provide  asynchronous  communicating  serial  processes.  Examples  include  CSP  [35],  Linda  [18], 
Actors  [1],  and  PVM  [56].  These  languages  are  well-suited  for  problems  (including  irregular  problems)  that 
can  be  specified  in  terms  of  coarse-grained  sub-tasks.  Unfortunately,  high  implementation  overhead  makes 
efficiency  too  dependent  on  finding  a  decomposition  into  reasonably  sized  blocks  [18].  As  a  result,  these 
systems  are  not  well-suited  for  exploiting  fine-grained  parallelism.  The  large  grain  size  renders  programs 
less  likely  to  be  efficient  on  most  parallel  supercomputers  because  they  will  not  vectorize  well  and  do  not 
expose  enough  parallelism  to  take  advantage  of  large  numbers  of  processors.  Extending  these  models  to 
capture  fine-grained  parallelism  is  an  area  of  active  research  [23]. 
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Figure  15;  CM-2  median:  NEsi  vs.  CM  Fortran. 


7  Conclusions 

The  purpose  of  nested  data-parallel  languages  is  to  provide  the  advantages  of  data  parallelism  while  ex¬ 
tending  their  :q)plicability  to  algorithms  that  use  “irreguiar”  data  structures.  The  main  advantages  of  data 
parallelism  that  should  be  preserved  are  the  efficient  implementation  of  fine-grained  parallelism  and  the 
simple  synchronous  programming  model. 

We  have  described  the  implementation  of  a  nested  data-parallel  language  called  NESL.  Nesl  was 
designed  to  allow  the  concise  description  of  parallel  algorithms  on  both  structured  and  unstructured  data.  It 
has  been  used  in  acourse  on  parallel  algorithms  and  has  allowed  students  to  quickly  implement  a  wide  variety 
of  programs,  including  systems  for  speech  recognition,  ray-tracing,  volume  rendering,  parsing,  maximum- 
flow,  singular  value  decomposition,  mesh  partitioning,  pattern  matching,  and  big-number  arithmetic  [16]. 
(A  lull  implementation  of  Nesl  is  available  from  blelloch@cs .  emu .  edu.) 

The  benchmark  results  in  this  paper  show  that  it  is  possibie  to  achieve  good  efficiency  with  a  nested  data- 
parallel  language,  across  a  variety  of  parallel  machines.  Nesl  runs  within  a  local  interactive  environment 
that  allows  the  user  to  execute  programs  remotely  on  any  of  the  supported  architectures.  This  portability 
depends  crucially  on  the  organization  of  the  system  and  the  use  of  an  intermediate  language. 

The  efficiency  of  Nesl  on  large  triplications  still  requires  further  study.  Some  other  issues  that  we  plan 
to  examine:  getting  better  efficiency  on  nested  parallel  code  with  many  conditionals;  the  specification  of 
data  layout  for  irregular  structures;  tools  for  profiling  nested  parallel  code;  the  interaction  of  higher-order 
functions  with  nested  parallelism;  and  porting  the  system  to  other  architectures. 
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A  C,  Fortran,  and  CM  Fortran  Benchmark  Code 

C  code  for  line  fitting 


void  flC(double  x(i,  double  y(l.  double  ‘a,  double  ‘b, 
double  *slga,  double  *slgb,  int  n) 

{ 

Inc  1; 

double  chl2,- 
doubla  xa  =  0.0; 
double  ya  =  0.0; 
double  see  2  0.0; 

•b  =  0.0; 

for  (1  =  0;  1  <  n;  1++)  { 
xa  x(i ] ; 
ya  +3  y[l); 

} 

xa  3  xa/n; 
ya  3  ya/n; 

for  (1  3  0;  1  <  n;  !♦♦)  { 
double  emp  3  xtl]  -  xa; 
sec  +3  emp  •  emp; 

*b  +3  Cnp  *  y ( 1 1 ; 

} 

*b  3  ‘b/SCC; 

•a  3  (ya  -  xa  *  'b) ; 
chi2  3  0; 

for  (130;  1  <  n;  1++)  { 

double  V  3  y[il  -  *a  -  *b  '  x(i); 
chi2  +3  V  *  v; 

} 

•siga  3  sqrc((1.0/n  +  xa*xa/Stt) ’chi2/ (n-2 . 0) ) ; 

•slgb  3  sqrc ( { 1 . 0/Scc) •chi2 / (n-2 . 0) ) ; 

} 


21 


Fortran  code  for  line  fitting 


subroutine  fit(x,  y,  a,  b,  slga,  slgb,  n) 
iR^llcit  real  (a-h,  o-z) 
real  x(n),  y{n) 

xa  =  0.0 
ya  =  0.0 
do  10  j  =  1,  n 
xa  =  xa  *  x(j) 
ya  =  ya  +  y(j) 

10  continue 

xa  s  xa  /  n 
ya  =  ya  /  n 
Stt  =0.0 
b  =  0.0 

do  30  j  =  1,  n 
tnq?  =  x(j)  -  xa 
Stt  =  Stt  +  tmp  •  trap 
b  =  b  ♦  tmp  *  y(j) 

30  continue 

b  =  b  /  Stt 
a  =  ya  -  xa*b 
chi2  =0.0 
do  50  j  =  1,  n 

chi2  =  chi2  (y(j)  -  a  -  b*x(j))**2 
50  continue 

siga  =  sqrt((1.0/n  ♦  xa'xa/Stt) *chl2/{n-2.0)) 

sigb  =  3qrt( (1.0/Stt) *chi2/(n-2.0) ) 

return 

end 

CM  Fortran  code  for  line  fitting 

subroutine  fit{x,  y,  a,  b,  siga,  sigb,  n) 
Implicit  double  precision  (a-h,  o-z) 
double  precision  x(n),  y(n) 

xa  =  sum(x) /n 

ya  =  sum(y)/n 

Stt  =  sum( (X  -  xa) **2) 

b  =  sum((x  -  xa)*y)/Stt 

a  =  ya  -  xa'b 

chi2  =  sum((y  -  a  -  b*x)**2) 

siga  =  sqrt((1.0/n  +  xa'xa/Stc) *chi2/ (n-2.0) ) 

sigb  =  sqrt ( ( 1 . 0/Stt) •chi2/ (n-2 . 0) ) 

return 

end 
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C  code  for  selection 


Int  select (inc  sf],  int  Ic,  inc  1,  inc  *tos) 

{ 

Int  1; 

InC  pivot  =  a [1/21; 
int  lesser  =  0; 

for  (i  =0;  i  <  1;  i++)  if  (s[i]  <  pivot)  leaser++: 
if  (k  <  lesser)  { 
int  j  =  0; 

for  (i  =  0;  i  <  1;  i++)  if  (s(il  <  pivot)  tos(j++l  =  s(il; 
return  select (toa,  k,  leaser,  tos  +  lesser); 

} 

else  { 

int  greater  =  0; 

for  <i  =  0;  i  <  1;  it+)  if  (sli)  >  pivot)  greater++; 
if  (k  >=  1  -  greater)  ( 
int  J  =  0; 

for  (i  =  0;  i  <  1;  i  +  +  )  if  (s(il  >  pivot)  tossljf-f)  =  sli); 
return  aelect(tos,  k  -  (1  -  greater),  greater,  tos  ♦  greater) 

} 

else  return  pivot; 

} 

} 
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Fortran  code  for  selection 


integer  function  select (t,  s,  d,  kk,  11) 
implicit  integer  (a-z) 

integer  s (11) ,  C(ll),  d(ll),  pivot,  greater,  lesser 
k  s  kk 
1  =  11 
do  i  =  1,  1 

3(i)  =  d(i) 

end  do 

5  continue 

pivot  =  3(1/2  +  1) 
le33er  =  0 
do  i  =  1,  1 

if  (s{i)  .It.  pivot)  lesser  =  lesser  ♦  1 
end  do 

if  (k  .It.  lesser)  Chen 
j  =  0 

do  i  =  1,  1 

if  (s(i)  .It.  pivot)  Chen 

j  =  j  +  1 
t(j)  =  s(i) 
endif 
end  do 

do  1  =  1,  lesser 
s(i)  =  t(i) 
end  do 
1  =  lesser 
else 

greater  =  0 
do  i  =  1,  1 

if  (s(i)  .gt.  pivot)  greater  =  greater  ♦  1 
end  do 

if  (k  .ge.  1  -  greater)  then 
j  =  0 

do  i  =  1,  1 

if  (s(i)  .gt.  pivot)  then 
j  =  j  t  1 
t(j)  =  s(i) 
endif 
end  do 

do  i  =  1,  greater 
s(i)  =  t(i) 
end  do 

k  =  k  -  (1  -  greater) 

1  -  greater 
else 
1  =  0 
endif 
endif 

if  (1  .gt.  0)  goto  5 
select  =  pivot 
return 
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CM  Fortran  code  for  selection 


lnc«g«r  function  select (src,  k,  n) 
liqpllclt  integer  (a-z) 
integer  3rc(n),  s(n) 

logical  greater(n\,  lesser (n)  ,  tnasktn) 

s  =  src 
on.count  =  n 
target  =  k 

do  while  (orucount  .gt.  0) 
pivot  =  s( or -count/ 2  +  1) 
forall  (i=l:n)  ma3k(l)  =  i  .le.  on.count 
lesser  ^  s  .It.  pivot  .and.  mask 
nUesser  =  count  (lesser) 
if  (target  .It.  n.lesser)  then 
3  =  pack(s,  lesser) 
on.coQnC  =  nUesser 
else 

greater  =  s  .gt.  pivot  .and.  mask 
n.greater  =  count (greater) 
if  (target  .ge.  on-Count  -  n.greater)  then 
3  =  pack(s,  greater) 

target  =  target  -  (on.count  -  n.greater) 
on.count  =  n.greater 
else 

on-count  =  0 
endif 
endif 
end  do 

select  =  pivot 

return 

end 

C  code  for  sparse-matrix  vector  product 


void  MxV(double  Result (1/  double  MvalJ},  int  Midxl),  a 


double  Vect(l,  int  nrows) 

{ 

int  ncols; 
double  sum; 

whlle  (nrows--)  { 
sum  =  0.0; 
ncols  =  *Mlent+; 
while  (ncols--)  sum  += 
•Result* +  =  sum; 

} 

} 


*Mval++  *  * (Vect  ♦  *Midx**) ; 


Mien (  ) , 
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Fortran  code  for  sparse<inatrix  vector  product 


subroutine  MxV(Result,  Hval,  Mldx,  Mien,  Vecc,  nrows) 

Integer  nrows 

Integer  Ml dx(*) ,  Mlen(*) 

real  Result (*),  MvaK*),  Vect(»),  sum 

Index  =  1 

do  Irow  =  1,  nrows 
sum  =  0.0 

do  Icol  =  1,  Mien (Irow) 

sum  =  sum  +  Mval( index)  '  Vect (Mldx (index) ) 
Index  =  Index  +  1 
end  do 

Result (Irow)  =  sum 
end  do 
return 
end 

CM  Fortran  code  for  sparse-matrix  vector  product 


subroutine  Setup (Mien,  Offset,  Segment,  nrows,  n) 
Include  '  /usr/include/cm/CMF.defs .  h ' 
integer  nrows,  n,  Mlen(nrows),  Offset (nrows) 
logical  Segment (n) 

call  CMF-SCANJU3D (Offset,  Mien,  CMFJflJLL,  1, 

$  CMF.OPWARD,  CMFJIXCLUSIVE,  CMFJdONE,  .TRUE.) 
Offset  =  Offset  ♦  1 

Segment  =  .FALSE. 

Segment (Of f set)  =  .TRUE. 

return 

end 


subroutine  MxV(ResulC, Mval , Midx, Offset, Segment . Vect , nrows, n) 
Include  ' /usr/include/cm/CMF.defs. h' 
integer  nrows,  n,  Midx(n),  Offset (nrows) 
logical  Segment (n) 

real  Mval(n),  ScanMe(n),  Result (nrows) ,  Vecc(nrows) 

ScanMe  =  Mval  *  Vect (Midx) 

call  CMF.SCANJiDD( ScanMe,  ScanMe,  Segment,  1, 

$  CMFJXWNWARD,  CMF-INCLUSIVE,  CMF-SEGMENT.BIT,  .TRUE.) 
Result  =  ScanMe (Offset) 
return 
end 
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